Lab 3 : Clustering

Team: Preeti Swaminathan, Patrick McDevitt, Andrew Abbott, Vivek Bejugama

10-dec-2017

Github link: https://github.com/bici-sancta/mashable/tree/master/lab_03


Table of Contents

  • 1.0 Business Understanding 1:

    • Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?).
    • How will you measure the effectiveness of a good algorithm?
    • Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?
  • 2.0 Data Understanding 1 :

    • Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.
    • Verify data quality:Are there missing values? Duplicate data? Outliers ? Are those mistakes?
    • How do you deal with these problems ?
  • 3.0 Data Understanding 2 :

    • Visualize the any important attributes appropriately.
    • Important: Provide an interpretation for any charts or graphs.
  • 4.0 Modeling and Evaluation 1 : Train and adjust parameters

    • 4.1. t-SNE dimensionality reduction
    • 4.2. K-Means Clustering
    • 4.3. SpectralClustering
    • 4.4. Hierarchical Clustering
  • 5.0 Modeling and Evaluation 2 : Evaluate and Compare

    • 5.1. K-Means Clustering
    • 5.2. SpectralClustering
    • 5.3. Hierarchical Clustering
  • 6.0 Modeling and Evaluation 3 : Visualize Results

    • 6.1. K-Means Clustering
    • 6.2. SpectralClustering
    • 6.3. Hierarchical Clustering
  • 7.0 Modeling and Evaluation 4 : Summarize the Ramifications

    • 7.1. K-Means Clustering
    • 7.2. SpectralClustering
    • 7.3. Hierarchical Clustering
    • 7.4. Comparitive View
  • 8.0 Deployment :

    • Be critical of your performance and tell the reader how you current model might be usable by other parties.
    • Did you achieve your goals? If not, can you reign in the utility of your modeling?
    • How useful is your model for interested parties(i.e.,the companies or organizations that might want to use it)?
    • How would your deploy your model for interested parties?
    • What other data should be collected?
    • How often would the model need to be updated, etc.?
  • Exceptional Work :

    • You have free reign to provide additional analyses or combine analyses.
  • Code Base :

    • All of our code needed to build models and evaluate are provided as additional files. In developing this model, we found that the code and embedded was producing large files that was difficult to maintain in a single notebook. We split the code into logical subsets to manage independent development of the data preparation phase, the dimensional reduction phase, and each of the clustering methods.
    • The code sequence to execute all of the analysis and images provided in this section of the report is as follows :

      • 01_create_dataset.ipynb
      • 02_tsne_perplexity_evaluation.ipynb
      • 03_kmeans_cluster.ipynb
      • 04_spectral_cluster.ipynb
      • 05_hierarchical_cluster.ipynb
      • 99_tsne_perplexity_plotter.ipynb (utility routine for plotting of t-sne vectors)
    • Any of the clustering routines (03, 04, 05, can be run independently following the 01 data prep and 02_ tsne dimensional reduction routines.

Note to our professor

Professor Jake,

We have organized this project different from other submissions.
Reasons

  1. Code base for Modeling and Evaluation 1 thru 4 is moved to different file. The plots you see in this file are images saved from code and inserted here. We had to do this as Clustering with t-SNE kept changing the grouping with each execution. Our code base is neatly ordered as mentioned in the structure above and it has been execute.

  2. We also had to split the file into multiple files as the size was getting bulkier.

  3. Table of conents above is a hyperlinked content. You can click on the main heading to directly go to the section. From any section you can return back to Table of contents section by clicking on the hyperlink "Return to Table of Contents" provided under each section in the document.

Thanks,
Team


Return to Table of Contents

In [2]:
from IPython.display import Image


1.0 Business Understanding 1

Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?).

We are using online news popularity dataset from UCI machine learning repository. The dataset is a collection of 61 heterogeneous set of features of approximately 40,000 articles published by Mashable (www.mashable.com) - the features are not the articles, but features extracted from the article, such as word counts, title word counts, keyword associations. The data represents a two year period of published articles, ending in January 2015.

We intend to perform cluster analysis on this data set.

The question from the business is to identify characteristic groupings from among the articles published in order to provide product insight to the product owners. In this case the product owners are the editorial owner of each data channel. Each data channel has its own editorial content owner and these owners have editorial control over what is published within their channel. The product owners have requested a cluster analysis to develop a better understanding of the characteristics over which they have control (e.g., length of articles, visual vs textual content volume, positive and negative sentiments, and cross-channel composition). The objective here is to provide this baseline cluster analysis; the content owners desire to then conduct A-B testing in future months within these characteristics to then understand if that can attract additional or different readership. To support this objective, the specific task at hand is to provide baseline cluster definitions, develop the method for deploying the cluster analysis such that the method is available for additional execution on an approximate monthly or bi-monthly basis. The repetivite analyses will assist to understand if the content owners are in fact successful in managing intended change (comparative cluster analysis to current baseline) and if those changes can be identified to promote or detract from additional readership.


The developers of the data set have requested the below citation request for any use of the data.

The data is located at https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity


Citation Request :


K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.


How will you measure the effectiveness of a good algorithm?

For the clutering analyses, we will evaluate different clustering methods and, for each clustering method, we will generate results using a wide range of appropriate values for the parameters that control the outputs of each cluster method. Each clustering method has different parameters and in some cases different options for performance evaluation. In all cases, we will be able to assess the silhouette score and also the process execution time for each modeling technique. For the range of parameters evaluated, we will select the model method that provides the highest silhouette score. The evaluation of execution time will be less definitively quantitative at this time; we will report if there are significant issues for extraordinarily long process time that would indicate a potential problem in deployment.

Subsequent to the clustering analyses, a comparative analysis across clustering methods will be provided. An element of success in this regard will be the relative consistency across clustering methods. If different clustering methods provide similar views of the cluster characteristics, we will use that as an indication that there is robustness and validity for the clusters identified. This comparative analysis will be somewhat qualitative in nature, but for the purposes of this first evaluation, and for the nature of this specific data set, we submit that that will provide a sufficient basis to identify if there is merit in further developing this cluster analysis concept. If unique characteristics are identifiable across the clusters, and if this characteristics are ostensibly controllable by the product owners for future modification, then the clustering is considered sufficiently successful for the current purposes.

Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?

The first part of the validation involves selecting the controlling parameters for each clustering method. This will be done using generally accepted standards, i.e., maximizing around silhouette scores which establishes that the selected parameters for each cluster method produces outputs with relative high cohesion / separation ratios for the defined clusters.

The additional element of validation is related to identifying actionable cluster definitions. The business objectives at this time are to (a) develop some baseline understanding, and (b) identify levers for content control in future months. To be consistent with this objective, provided the cluser definitions identify modifiable characteristics within each cluster definition satisfies the current objective.


Return to Table of Contents

2.0 Data Understanding 1

Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.

Attribute Information:

 0. url:                           URL of the article
 1. timedelta:                     Days between the article publication and the dataset acquisition
 2. n_tokens_title:                Number of words in the title
 3. n_tokens_content:              Number of words in the content
 4. n_unique_tokens:               Rate of unique words in the content
 5. n_non_stop_words:              Rate of non-stop words in the content
 6. n_non_stop_unique_tokens:      Rate of unique non-stop words in the content
 7. num_hrefs:                     Number of links
 8. num_self_hrefs:                Number of links to other articles published by Mashable
 9. num_imgs:                      Number of images
10. num_videos:                    Number of videos
11. average_token_length:          Average length of the words in the content
12. num_keywords:                  Number of keywords in the metadata
13. data_channel_is_lifestyle:     Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus:           Is data channel 'Business'?
16. data_channel_is_socmed:        Is data channel 'Social Media'?
17. data_channel_is_tech:          Is data channel 'Tech'?
18. data_channel_is_world:         Is data channel 'World'?
19. kw_min_min:                    Worst keyword (min. shares)
20. kw_max_min:                    Worst keyword (max. shares)
21. kw_avg_min:                    Worst keyword (avg. shares)
22. kw_min_max:                    Best keyword (min. shares)
23. kw_max_max:                    Best keyword (max. shares)
24. kw_avg_max:                    Best keyword (avg. shares)
25. kw_min_avg:                    Avg. keyword (min. shares)
26. kw_max_avg:                    Avg. keyword (max. shares)
27. kw_avg_avg:                    Avg. keyword (avg. shares)
28. self_reference_min_shares:     Min. shares of referenced articles in Mashable
29. self_reference_max_shares:     Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess:    Avg. shares of referenced articles in Mashable
31. weekday_is_monday:             Was the article published on a Monday?
32. weekday_is_tuesday:            Was the article published on a Tuesday?
33. weekday_is_wednesday:          Was the article published on a Wednesday?
34. weekday_is_thursday:           Was the article published on a Thursday?
35. weekday_is_friday:             Was the article published on a Friday?
36. weekday_is_saturday:           Was the article published on a Saturday?
37. weekday_is_sunday:             Was the article published on a Sunday?
38. is_weekend:                    Was the article published on the weekend?
39. LDA_00:                        Closeness to LDA topic 0
40. LDA_01:                        Closeness to LDA topic 1
41. LDA_02:                        Closeness to LDA topic 2
42. LDA_03:                        Closeness to LDA topic 3
43. LDA_04:                        Closeness to LDA topic 4
44. global_subjectivity:           Text subjectivity
45. global_sentiment_polarity:     Text sentiment polarity
46. global_rate_positive_words:    Rate of positive words in the content
47. global_rate_negative_words:    Rate of negative words in the content
48. rate_positive_words:           Rate of positive words among non-neutral tokens
49. rate_negative_words:           Rate of negative words among non-neutral tokens
50. avg_positive_polarity:         Avg. polarity of positive words
51. min_positive_polarity:         Min. polarity of positive words
52. max_positive_polarity:         Max. polarity of positive words
53. avg_negative_polarity:         Avg. polarity of negative  words
54. min_negative_polarity:         Min. polarity of negative  words
55. max_negative_polarity:         Max. polarity of negative  words
56. title_subjectivity:            Title subjectivity
57. title_sentiment_polarity:      Title polarity
58. abs_title_subjectivity:        Absolute subjectivity level
59. abs_title_sentiment_polarity:  Absolute polarity level
60. shares:                        Number of shares (target)
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.simplefilter('ignore',DeprecationWarning)
import seaborn as sns
import time
In [4]:
#Import Data from .csv file

# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# ... change directory as needed to point to local data file 
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
df = pd.read_csv('../data/OnlineNewsPopularity.csv')  


# Strip leading spaces and store all the column names to a list
df.columns = df.columns.str.strip()
#col_names = df.columns.values.tolist()

2.b Verify data quality

  • Step 1: Using the standard read.csv function, all the variables are imported as data type float64. Many of the values are more logically integer (counts) or boolean (e.g., is_weekend). We will convert those fields to data types appropriate to their nature.

  • Step 2: Using the 'duplicated' function in python, we confirmed that there are no duplicated data in this data set.

  • Step 3: Using standard python functions, we can also see that there are no missing values for any of the data cells either.

  • Note: So, those two traditional questions in the data cleaning phase do not need any specific action for this data set.

  • Step 4: There are, however, outliers in some of the data columns. Based on observation from scatter plots, histograms, and the evaluation of descriptive statistics (especially skewness) we consider to transform all of the variables with high right-skewness (i.e., skewness > 1) using a log transformation. Since the purpose of this data set and evaluation is to perform cluster analysis, performing log transform on right skewed data may prove beneficial in improving the distance characteristics in a way thst supports clustering analysis.

  • Step 5: Subsequent to performing the log transform, we can further evaluate for outliers in that tranformed data space.

Data selection: identify important features

Step 1: convert to appropriate data type

In [5]:
# Converting the data type to Integer
to_int = ['timedelta','n_tokens_title', 'n_tokens_content','num_keywords',
          'num_hrefs','num_self_hrefs', 'num_imgs', 'num_videos','shares' ]
df[to_int] = df[to_int ].astype(np.int64)

Step 2: Identify duplicates

In [6]:
# Check for duplicates
df[df.duplicated()]
Out[6]:
url timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs ... min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares

0 rows × 61 columns

From above, we confirmed that there are no duplicated data in this data set.

Step 3: Identify missing values

In [7]:
df.describe().T
Out[7]:
count mean std min 25% 50% 75% max
timedelta 39644.0 354.530471 214.163767 8.00000 164.000000 339.000000 542.000000 731.000000
n_tokens_title 39644.0 10.398749 2.114037 2.00000 9.000000 10.000000 12.000000 23.000000
n_tokens_content 39644.0 546.514731 471.107508 0.00000 246.000000 409.000000 716.000000 8474.000000
n_unique_tokens 39644.0 0.548216 3.520708 0.00000 0.470870 0.539226 0.608696 701.000000
n_non_stop_words 39644.0 0.996469 5.231231 0.00000 1.000000 1.000000 1.000000 1042.000000
n_non_stop_unique_tokens 39644.0 0.689175 3.264816 0.00000 0.625739 0.690476 0.754630 650.000000
num_hrefs 39644.0 10.883690 11.332017 0.00000 4.000000 8.000000 14.000000 304.000000
num_self_hrefs 39644.0 3.293638 3.855141 0.00000 1.000000 3.000000 4.000000 116.000000
num_imgs 39644.0 4.544143 8.309434 0.00000 1.000000 1.000000 4.000000 128.000000
num_videos 39644.0 1.249874 4.107855 0.00000 0.000000 0.000000 1.000000 91.000000
average_token_length 39644.0 4.548239 0.844406 0.00000 4.478404 4.664082 4.854839 8.041534
num_keywords 39644.0 7.223767 1.909130 1.00000 6.000000 7.000000 9.000000 10.000000
data_channel_is_lifestyle 39644.0 0.052946 0.223929 0.00000 0.000000 0.000000 0.000000 1.000000
data_channel_is_entertainment 39644.0 0.178009 0.382525 0.00000 0.000000 0.000000 0.000000 1.000000
data_channel_is_bus 39644.0 0.157855 0.364610 0.00000 0.000000 0.000000 0.000000 1.000000
data_channel_is_socmed 39644.0 0.058597 0.234871 0.00000 0.000000 0.000000 0.000000 1.000000
data_channel_is_tech 39644.0 0.185299 0.388545 0.00000 0.000000 0.000000 0.000000 1.000000
data_channel_is_world 39644.0 0.212567 0.409129 0.00000 0.000000 0.000000 0.000000 1.000000
kw_min_min 39644.0 26.106801 69.633215 -1.00000 -1.000000 -1.000000 4.000000 377.000000
kw_max_min 39644.0 1153.951682 3857.990877 0.00000 445.000000 660.000000 1000.000000 298400.000000
kw_avg_min 39644.0 312.366967 620.783887 -1.00000 141.750000 235.500000 357.000000 42827.857143
kw_min_max 39644.0 13612.354102 57986.029357 0.00000 0.000000 1400.000000 7900.000000 843300.000000
kw_max_max 39644.0 752324.066694 214502.129573 0.00000 843300.000000 843300.000000 843300.000000 843300.000000
kw_avg_max 39644.0 259281.938083 135102.247285 0.00000 172846.875000 244572.222223 330980.000000 843300.000000
kw_min_avg 39644.0 1117.146610 1137.456951 -1.00000 0.000000 1023.635611 2056.781032 3613.039820
kw_max_avg 39644.0 5657.211151 6098.871957 0.00000 3562.101631 4355.688836 6019.953968 298400.000000
kw_avg_avg 39644.0 3135.858639 1318.150397 0.00000 2382.448566 2870.074878 3600.229564 43567.659946
self_reference_min_shares 39644.0 3998.755396 19738.670516 0.00000 639.000000 1200.000000 2600.000000 843300.000000
self_reference_max_shares 39644.0 10329.212662 41027.576613 0.00000 1100.000000 2800.000000 8000.000000 843300.000000
self_reference_avg_sharess 39644.0 6401.697580 24211.332231 0.00000 981.187500 2200.000000 5200.000000 843300.000000
weekday_is_monday 39644.0 0.168020 0.373889 0.00000 0.000000 0.000000 0.000000 1.000000
weekday_is_tuesday 39644.0 0.186409 0.389441 0.00000 0.000000 0.000000 0.000000 1.000000
weekday_is_wednesday 39644.0 0.187544 0.390353 0.00000 0.000000 0.000000 0.000000 1.000000
weekday_is_thursday 39644.0 0.183306 0.386922 0.00000 0.000000 0.000000 0.000000 1.000000
weekday_is_friday 39644.0 0.143805 0.350896 0.00000 0.000000 0.000000 0.000000 1.000000
weekday_is_saturday 39644.0 0.061876 0.240933 0.00000 0.000000 0.000000 0.000000 1.000000
weekday_is_sunday 39644.0 0.069039 0.253524 0.00000 0.000000 0.000000 0.000000 1.000000
is_weekend 39644.0 0.130915 0.337312 0.00000 0.000000 0.000000 0.000000 1.000000
LDA_00 39644.0 0.184599 0.262975 0.00000 0.025051 0.033387 0.240958 0.926994
LDA_01 39644.0 0.141256 0.219707 0.00000 0.025012 0.033345 0.150831 0.925947
LDA_02 39644.0 0.216321 0.282145 0.00000 0.028571 0.040004 0.334218 0.919999
LDA_03 39644.0 0.223770 0.295191 0.00000 0.028571 0.040001 0.375763 0.926534
LDA_04 39644.0 0.234029 0.289183 0.00000 0.028574 0.040727 0.399986 0.927191
global_subjectivity 39644.0 0.443370 0.116685 0.00000 0.396167 0.453457 0.508333 1.000000
global_sentiment_polarity 39644.0 0.119309 0.096931 -0.39375 0.057757 0.119117 0.177832 0.727841
global_rate_positive_words 39644.0 0.039625 0.017429 0.00000 0.028384 0.039023 0.050279 0.155488
global_rate_negative_words 39644.0 0.016612 0.010828 0.00000 0.009615 0.015337 0.021739 0.184932
rate_positive_words 39644.0 0.682150 0.190206 0.00000 0.600000 0.710526 0.800000 1.000000
rate_negative_words 39644.0 0.287934 0.156156 0.00000 0.185185 0.280000 0.384615 1.000000
avg_positive_polarity 39644.0 0.353825 0.104542 0.00000 0.306244 0.358755 0.411428 1.000000
min_positive_polarity 39644.0 0.095446 0.071315 0.00000 0.050000 0.100000 0.100000 1.000000
max_positive_polarity 39644.0 0.756728 0.247786 0.00000 0.600000 0.800000 1.000000 1.000000
avg_negative_polarity 39644.0 -0.259524 0.127726 -1.00000 -0.328383 -0.253333 -0.186905 0.000000
min_negative_polarity 39644.0 -0.521944 0.290290 -1.00000 -0.700000 -0.500000 -0.300000 0.000000
max_negative_polarity 39644.0 -0.107500 0.095373 -1.00000 -0.125000 -0.100000 -0.050000 0.000000
title_subjectivity 39644.0 0.282353 0.324247 0.00000 0.000000 0.150000 0.500000 1.000000
title_sentiment_polarity 39644.0 0.071425 0.265450 -1.00000 0.000000 0.000000 0.150000 1.000000
abs_title_subjectivity 39644.0 0.341843 0.188791 0.00000 0.166667 0.500000 0.500000 0.500000
abs_title_sentiment_polarity 39644.0 0.156064 0.226294 0.00000 0.000000 0.000000 0.250000 1.000000
shares 39644.0 3395.380184 11626.950749 1.00000 946.000000 1400.000000 2800.000000 843300.000000

From above, we can also see that there are no missing values for any of the data cells either.

  • Note: So, those two traditional questions in the data cleaning phase do not need any specific action for this data set.

Step 4 & 5 :: Identify Outliers & transform data

  • On an average every article has been shared 3395 times with every article being shared atleast once. The max times an article was shared is 843K.
  • Average Numnber of token in title of all articles is ~10 words with ranges from 2 to 23 words and std of 2.1 words.
  • Average number of tokens in content of all articles is 546 words. With minimum being 0 workds implies some articles have no words in content. These could include articles with only videos and images.
  • All the articles have on a average of 4.5 images and 1.2 videos and there are many articles with no images and videos as well.
In [8]:
df["shares"].describe().T
Out[8]:
count     39644.000000
mean       3395.380184
std       11626.950749
min           1.000000
25%         946.000000
50%        1400.000000
75%        2800.000000
max      843300.000000
Name: shares, dtype: float64
  • Quartiles
    1: 946
    2: 1400
    3: 2800

  • This is consistent with the stated business objective to characterize
    < 946 as Regular
    between 946 to 1400 as Good
    between 1400 and 2800 as Popular
    anything higher as Viral

Data Selection - First evaluation

  • There are 60 columns in the original data set; we added a few additonal columns based on observed opportunities (e.g., _publicationdate, ...) as explained above.

  • From this data set, we did a simple correlation matrix to look for variables that are highly correlated with each other that could be removed with little loss of information.

  • With that downselection, we proceeded with additional evaluation of these remaining variables.

  • we recognize that there is likely significant additional opportunity for modeling improvements with many of the remaining variables, and will look to re-expand the data set to further consider that with future work.

In [9]:
sns.set(style="white")

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 13))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})


# from example found at https://www.kaggle.com/maheshdadhich/strength-of-visualization-python-visuals-tutorial/notebook
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa49f78ef60>
Remove highly correlated features

Using the correlation matrix above along with judgement we removed the following variables:

  • 'url': parsed into new variables
  • 'timedelta': not predictive
  • 'num_self_hrefs':
  • 'n_unique_tokens':
  • 'average_token_length': high correlation with mutliple other variables
  • 'kw_min_min':
  • 'LDA_03': LDAs are all related, limited to 0, 1, 2
  • 'LDA_04': LDAs are all related, limited to 0, 1, 2
  • 'global_subjectivity': high correlation with mutliple other variables
  • 'min_positive_polarity'
  • 'max_positive_polarity'
  • 'min_negative_polarity'
  • 'max_negative_polarity'
  • 'global_sentiment_polarity'
  • 'n_non_stop_words'
  • 'n_non_stop_unique_tokens':
  • 'kw_max_min'
  • 'kw_avg_min'
  • 'kw_min_max'
  • 'kw_max_max'
  • 'kw_avg_max'
  • 'kw_min_avg'
  • 'kw_max_avg'
  • 'rate_negative_words'
  • 'avg_positive_polarity'
  • 'self_reference_min_shares'
  • 'weekday_is_saturday'
  • 'weekday_is_sunday'
  • 'self_reference_max_shares'
  • 'title_subjectivity'
  • 'shares'
  • 'rate_positive_words'
  • 'abs_title_sentiment_polarity'
In [10]:
# Clasifing Atributes for easy analysis 
dropped_features = ['url','timedelta','num_self_hrefs','n_unique_tokens','average_token_length','kw_min_min','LDA_03','LDA_04',
                    'global_subjectivity','min_positive_polarity','max_positive_polarity',
                    'min_negative_polarity','max_negative_polarity','global_sentiment_polarity',
                    'n_non_stop_words','n_non_stop_unique_tokens','kw_max_min','kw_avg_min',
                    'kw_min_max','kw_max_max','kw_avg_max','kw_min_avg','kw_max_avg',
                    'rate_negative_words','avg_positive_polarity','self_reference_min_shares',
                    'weekday_is_saturday','weekday_is_sunday','self_reference_max_shares','title_subjectivity',
                    'shares','rate_positive_words','abs_title_sentiment_polarity']

df1 = df.drop(dropped_features, axis = 1)


# Compute the correlation matrix
corr = df1.corr()

Correlation matrix of important features

In [11]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 13))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa49c0af198>


Return to Table of Contents

3.0 Data Understanding 2

3.a Visualization of important attributes.
  • There are 60 columns in the original data set; we added a few additonal columns based on observed opportunities (e.g., _publicationdate, ...) as explained above.

  • From this data set, we did a simple correlation matrix to look for variables that are highly correlated with each other that could be removed with little loss of information.

  • With that downselection, we proceeded with additional evaluation of these remaining variables.

  • we recognize that there is potentially additional opportunity for modeling improvements with some of the remaining variables, and will look to re-expand the data set to further consider that with future work.

Boxplot of important features
In [12]:
imp_features = ['n_tokens_title',
 'n_tokens_content',
 'num_hrefs',
 'num_imgs',
 'num_videos',
 'num_keywords',
 'kw_avg_avg',
 'self_reference_avg_sharess',
 'LDA_00',
 'LDA_01',
 'LDA_02',
 'global_rate_positive_words',
 'global_rate_negative_words',
 'avg_negative_polarity',
 'title_sentiment_polarity',
 'abs_title_subjectivity']

for var in imp_features:
    df1.boxplot(column = var)
    plt.show()
Histogram of important features
   This section contains histograms and cross tabs of several important variables and the popularity rates for each group of variables. The histograms show the distributions of several important variables. 

After looking at the boxplots of these variables it is evident that many are heavily skewed. Before they are used for prediction a log transformation would be beneficial. The following bit of code makes those transformations, creating new variables.

In [13]:
# ---------------------------------
# Log transform variables with high skewness
# ---------------------------------

log_features = ['n_tokens_content',
 'num_hrefs',
 'num_imgs',
 'num_videos',
 'kw_avg_avg',
 'self_reference_avg_sharess',]

df1 = df

# store min value for each column
df_mins = df1[log_features].min()

for column in log_features:
    sk = df1[column].skew()
    if(sk > 1):
        new_col_name = 'ln_' + column
        print (column, sk, new_col_name)
        if df_mins[column] > 0:
            df1[new_col_name] = np.log(df1[column])
        elif df_mins[column] == 0:
            df_tmp = df1[column] + 1
            df1[new_col_name] = np.log(df_tmp)
        else:
            print('--> Log transform not completed :', column, '!!')
n_tokens_content 2.94542193879 ln_n_tokens_content
num_hrefs 4.0134948282 ln_num_hrefs
num_imgs 3.94659584465 ln_num_imgs
num_videos 7.0195327863 ln_num_videos
kw_avg_avg 5.76017729162 ln_kw_avg_avg
self_reference_avg_sharess 17.9140933777 ln_self_reference_avg_sharess
In [14]:
plt.hist(df1['n_tokens_title'], bins = 20)
plt.xlabel('Number of title words')
plt.ylabel('Frequency')
plt.show()

First is the distribution of the number of words contained in the title. This variable is normally distributed and does not require transformation.

  • n_tokens_contant: 'n_tokens_contant' refers to the number of words in the text of the article. It is reasonable to imagine a relationship between the length of an article and the number of shares it receives.
In [15]:
plt.hist(df1['n_tokens_content'], bins = 20)
plt.xlabel('Number of words')
plt.ylabel('Frequency')
plt.show()

plt.hist(df1['ln_n_tokens_content'], bins = 20)
plt.xlabel('Log number of words')
plt.ylabel('Frequency')
plt.show()

The distribution of the untransformed data is heavily skewed. the tranformed distribution shows the improvement.

  • df_channel:
In [16]:
df_channel = df1

df_channel['data_channel'] = np.NaN
condition = df['data_channel_is_lifestyle'] == 1
df_channel.loc[condition, 'data_channel'] = 'Lifestyle'
condition = df['data_channel_is_entertainment'] == 1
df_channel.loc[condition, 'data_channel'] = 'Entertainment'
condition = df['data_channel_is_bus'] == 1
df_channel.loc[condition, 'data_channel'] = 'Business'
condition = df['data_channel_is_socmed'] == 1
df_channel.loc[condition, 'data_channel'] = 'SocMed'
condition = df['data_channel_is_tech'] == 1
df_channel.loc[condition, 'data_channel'] = 'Tech'
condition = df['data_channel_is_world'] == 1
df_channel.loc[condition, 'data_channel'] = 'World'

df_channel = df_channel.groupby(by=['data_channel'])
channel_count = df_channel['data_channel'].count()
_=channel_count.plot(kind='barh', stacked=True, color = ['blue'])
  • Global rate of positive words Another variable which would logically seem to be important for popularity is the global rate of positive words. This is a rate,so is between zero and one, and it is also normally distributed as seen in the plot below.
In [17]:
plt.hist(df1['global_rate_positive_words'], bins = 20)
plt.xlabel('Global rate of positive words')
plt.ylabel('Frequency')
plt.show()
plots and interpretation of word counts and digital media
  • The data set has features in these 6 broad categories :
    (ref - see citation reference at beginning of this document)
    • Words
      • Number of words of the title/content
      • Average word length
      • Rate of unique/non-stop words of contents
    • Links
      • Number of links
      • Number of links to other articles in Mashable
    • Digital Media
      • Number of images/videos
        -Time
      • Day of the week/weekend
    • Keywords
      • Number of keywords
      • Worst/best/average keywords (#shares)
      • Article category
    • NLP
      • Closeness to five LDA topics
      • Title/Text polarity/subjectivity
      • Rate and polarity of positive/negative words
      • Absolute subjectivity/polarity level

As a first exploratory for the relationships among the available variables, we choose to select 9 variables from 4 of the above categories that appear, on the surface, to be highly relevant to the business case objective.

In this data set, _tokenscount refers to the number of words in the published article. As we wish to understand word counts, image counts, and video counts are likely independent predictors of number of eventual shares, to understand their respective utility in a predictive model, we evaluated scatter plots of (1) number of tokens vs. number of images, (2) number of tokens vs. number of videos and (3) number of images vs. number of videos.
These are shown in below plots.

In [18]:
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# ... Tokens, Images, Videos plots
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import numpy as np

df1.log_n_tokens = np.log(df1['n_tokens_content'])
df1.log_n_imgs = np.log(df1['num_imgs'])
df1.log_n_videos = np.log(df1['num_videos'])

plt.plot(df1.log_n_tokens, df1.log_n_imgs, label = 'Tokens-content - NumImages', linestyle = 'None', marker = 'o')
plt.xlabel('Tokens-content')
plt.ylabel('Images')
plt.title('mashable characteristics')
plt.legend()
plt.show()

plt.plot(df1.log_n_tokens, df1.log_n_videos, label = 'Tokens-content - NumVideos', linestyle = 'None', marker = 'o')
plt.xlabel('Tokens-content')
plt.ylabel('Videos')
plt.title('mashable characteristics')
plt.legend()
plt.show()

plt.plot(df1.log_n_imgs, df1.log_n_videos, label = 'NumImages - NumVideos', linestyle = 'None', marker = 'o')
plt.xlabel('Images')
plt.ylabel('Videos')
plt.title('mashable characteristics')
plt.legend()
plt.show()
/home/mcdevitt/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log
  
/home/mcdevitt/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: divide by zero encountered in log
  import sys
/home/mcdevitt/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:8: RuntimeWarning: divide by zero encountered in log
  

We can make the following observations :

  • the data values are generally well populated, with high variability
  • there does not appear to be a strong correlation between any of these sets of variables (potentially good news for future predictive model building)
  • we choose to represent these values in ln-transformed space, due to the very wide range of values in each variables domain, and the strong right-skewness of each distribution

Plots and interpretation of LDA00 thru 04

LDA - Latent Dirichlet allocation

_(ref : https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)_

LDA (in this context) refers to a method by which a body of text can be scored relative to a vocabulary set that is identified with specific topics. A body of text that discusses Machine Learning, for instance, will use vocabulary specific to that topic, and quite different from another text which discusses do-it-yourself home repair. A text can be scored relative to the similarity of a given LDA scale and then compared among other texts for similarity or difference.
This data set includes measures for 5 LDA topics, identified here as : _LDA00, _LDA01, ... and _LDA04.
Similar to the above visualization, we choose to review the relative visual correlation among the LDA scores of these articles via scatter plots, this time with each LDA score plotted against _LDA00 (vs. _LDA01, _LDA02, etc.) and a basic histogram of each individual distribution. These are shown in the below plots.

In [19]:
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# ... LDA plots
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

num_bins = 20
plt.hist(df1.LDA_00, num_bins, facecolor='blue', alpha=0.5)
plt.title('LDA 00')
plt.show()

num_bins = 20
plt.hist(df1.LDA_01, num_bins, facecolor='slateblue', alpha=0.5)
plt.title('LDA 01')
plt.show()

num_bins = 20
plt.hist(df1.LDA_02, num_bins, facecolor='mediumorchid', alpha=0.5)
plt.title('LDA 02')
plt.show()

plt.plot(df1.LDA_00, df1.LDA_01, label = 'LDA_00 - LDA_01', linestyle = 'None', marker = 'o')
plt.xlabel('LDA_00')
plt.ylabel('LDA_01')
plt.title('mashable characteristics')
plt.legend()
plt.show()

plt.plot(df1.LDA_00, df1.LDA_02, label = 'LDA_00 - LDA_02', linestyle = 'None', marker = 'o')
plt.xlabel('LDA_00')
plt.ylabel('LDA_04')
plt.title('mashable characteristics')
plt.legend()
plt.show()

We can make the following observations :

  • Each of the historgrams shows a very high frequency at or near the zero axis but also a reasonably sized portion of each population widely distributed along the 0 - 1 axis range of the LDA score. This is (a) expected, and (b) potentially a very positive aspect of the data set for future analysis. This is expected because there are a wide range of topics published on the mashable web-site and each range of topics is expected to have disparate LDA scores. This is potentially useful because it provides a wide-ranging set of diverse measures which may prove to be predictive in the eventual business objective evaluation.
  • Each of the scatter plots shows (of which only 2 are included here, as the remaining are all very similar in nature visually) that there is little correlation between any of two LDA scores. This is also expected (again due to the wide range of mashable topics), but provides visual confidence that the data values provided have good domain range and are likely sufficiently non-correlated as to be useful in model building.

The data set contains an interesting feature idea for this evaluation. The links that are embedded in each article, prior to publication, are evaluated for the keywords in those links and the number of social media shares associated to those embedded links. These are then scored relative to the success rates of the each of the keywords on a best/avg/worst basis. This is an interesting concept, in essence, to use previous related social media share success / failure keywords as an estimator of a to-be-published article. Since this is an interesting, and potentially useful, feature in this data set, we decided to explore with a simple data view to visualize the relationship between _keywords_avgmax and _self_referenceshares, to verify that the distributions are understood and also that there is non-dependency between features such as _kw_avgmax and _self_referenceshares.

In [20]:
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# ... Keywords visualizations
# ... -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import numpy as np

num_bins = 20
plt.hist(df1.kw_avg_max, num_bins, facecolor='mediumorchid', alpha=0.5)
plt.title('ln_self_reference_avg_shares')
plt.show()

df1.log_self_share = np.log(df1['self_reference_avg_sharess'] + 1)
axes = plt.gca()
axes.set_xlim([4,14])
num_bins = 20
plt.hist(df1.log_self_share, num_bins, facecolor='slateblue', alpha=0.5)
plt.title('ln_self_reference_avg_shares')
plt.show()

axes = plt.gca()
axes.set_ylim([4,14])
plt.plot(df1.kw_avg_max, df1.log_self_share, label = 'kw_avg_max - self_ref_avg', linestyle = 'None', marker = 'o')
plt.xlabel('kw_avg_max')
plt.ylabel('ln_self_reference_avg_sharess')
plt.title('mashable characteristics')
plt.legend()
plt.show()

From above plots, we observe the following :

  • The histograms shows a reasonably sized portion of each population distributed along the full domain range of each feature. This is a positive characteristic of the data set for future analysis.
  • The scatter plots shows that there is little correlation between the two features, again identifying that these features are likely sufficiently non-correlated as to be useful in model building.


Return to Table of Contents


4. Modeling and Evaluation 1

Under this section we will talk a bit about our exceptional work t-SNE. We will follow the below steps


  • 4.1 Train :
    • Perform t-SNE adjusting its parameter (perplexity)
    • Perform cluster analysis using several clustering methods (adjust parameters).
        - Non-Hierarchical: K-means, Spectral
        - Hierarchical: Linkage type Ward, Average, Complete.
  • 4.2 Eval :
    • Use internal and/or external validation measures to describe and compare the clusterings and the clusters
    • How did you determine a suitable number of clusters for each method ?
  • 4.3 Visualize :
    • Use tables/visualization to discuss the found results.
    • Explain each visualization in detail.
  • 4.4 Summarize :
    • Describe your results.
    • What findings are the most interesting and why ?

4.1. Train and Adjust parameters

The base data set from which we are starting has appoximately 35 features. The data set was cleaned and pre-processed for analysis, as outlined in the prior sections of this report - missing values identified, outliers dispositioned, and also all features re-scaled to standard normal distribution.

In the nominal data set some features are naturally scaled from 0 to 1 (real values), such as the Latent Dirichlet Allocation (LDA) measures, while other features are measured in the range of 0 - 800,000 ! (e.g. number of shares in social media context). Since both dimensionality reduction and cluster analyses depend on relative magnitudes, all features were mapped to standard normal distribution to provide even weighting of all features in the mapping / clustering processes. The binary features (e.g., is_data_channel_technology) are all retained as binary 0/1 valued features and one-hot encoded to similarly support evenly distributed distance evaluations among such categorical features.

Early efforts in which we attempted to use the cleaned data set and perform cluster analyses yielded results which did not provide straightforward interpretations of the clustering results. Visually, the cluster maps did not provide well-organized presentations of clusters and the silhouette and distortion metrics were generally disorganized as a function of the number of clusters - these metrics were not smooth functions that indicated in any clear sense an optimal or even preferred number of clusters from those analyses. Methods attempted at that point included k-means, DBSCAN, and Spectral Clustering.

  • ### t-Distributed Stochastic Neighbor Embedding

Thus, we were motivated to explore dimensionality reduction as a means to simplify the data set that we presented to the clustering algorithms. Evaluating choices for dimensionality reduction we considered Principal Components Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Between the two methods, we decided to evaluate t-SNE.

  • t-SNE is a recently developed method (2008)2 that provides a means of dimensional reduction and is becomg popular as a visualization tool for high dimensional data.
    • The method is a probablistic method of mapping distance distributions from the high dimensional space to a lower dimensional space. In contrast to PCA, the t-SNE approach can provide differing results on successive solutions on the same data set.
    • It is somewhat computationally expensive. The proceesing time can be cumbersome on large data sets. For our case, we chose to use 35% of the ~40,000 rows and 35 features. The processing time for a 2-dimensional mapping varied from 200 seconds (perplexity = 5) to 46 minutes (perplexity = 1000). This time duration was supportable with a reasonable compromise by sampling 1/3 of the data set. The 35% sample from the full data set was selected as a random sample. The distributions of the features in the sampled set were similar to the distributions in the nominal full data set. This data set does not have features that are heavily unbalanced, so random sampling within this data produces samples with similar distributions.
    • multiple runs of the t-SNE algorithm with different samples produced t-SNE maps with visual similarities, and produced very consistent measures of KL-divergence from run to run. Thus, sampling within the set did not appreciably influence the produced mapped data.
    • the controlling parameter in the sci-kit learn implementation of t-SNE is the perplexity parameter. Perplexity in t-SNE is the parameter that functions to set the number of nearest neighbors in the mapped space. The authors of the method suggest that perplexity values in the range of 5 to 50 are typically used. For our evaluation, we experimented with perplexity in the range of 5 to 1000. Kullback-Leibler divergence is the provided output measure in the sci-kit learn model and is recommended as a means to monitor the relative improvment of one t-SNE mapping in comparison with others as a function of the perplexity. The K-L divergence acts in this case as a measure of the cross-entropy between the provided feature set and the t-SNE mapped distribution.1 In our analysis, the K-L divergence continued to decrease with increasing perplexity without achieving a demonstrated minimum, although the curve of K-L divergence plotted vs. perplexity does exhibit an elbow with perplexity in the 200 - 400 range. In addition, the plot of processing time vs. perplexity shows an essentially linear relationship; thus increasing perplexity comes with a consequent cost to process time. Visually, the t-SNE mappings in 2-D space appear to generate visually separable clusters and approximately the same number of visual clusters. We estimate that using perplexity in the range of 75 - 400 for this data set provides reasonably consistent results for the purposes of this evaluation.
    • in any case, once computed, the t-SNE maps for all values of perplexity were saved and are available for use in the subsequent clustering applications
    • since the original data set was sampled at 35% of the full data set, the indices of the sampled rows were retained and written back to the saved data set of t-SNE vectors, so that the identical rows could be matched from the full data set during the later evaluation to be completed after the clustering analyses

K-L Divergence and Processing Time in t-SNE mapping with increasing perplexity

In [21]:
Image("./plots/tsne/t_sne_divergence_process_time.png")
Out[21]:

Examples of 2D mapping in t-SNE space (t1 vs. t2) at various values of perplexity

In [22]:
Image("./plots/tsne/t_sne_clusters_all_together.png")
Out[22]:

4.1.2 Train and Adjust clustering algorithm

Having completed the t-SNE mapping, the next step in the process was to apply different clustering methods for evaluation of appropriate clustering results.

In our evaluation, we chose to evaluate with

  • k-means clustering
  • spectral clustering
  • hierarchical clustering

These three methods have fundamental differences and we assessed that they can provide different opportunities to identify different resulting cluster definitions.

Train and adjust K-Means Clustering

  • the process for implementing the k-means was as follows :
    • read into memory the stored t-SNE vectors from the previously completed t-SNE mapping, along with the reference indices
    • initiate the k-means clustering from the sci-kit learn library
    • the initialization method chosen is 'kmeans++' as that can improve convergence time
  • the number of clusters was evaluated for 2 - 20 clusters
  • the results presented here correspond to the t-SNE 2-D vectors associated to perplexity value of 100
In [23]:
Image("./plots/kmeans/kmeans_7_cluster_map.png")
Out[23]:

Train and adjust Spectral Clustering

  • the process for implementing the spectralclustering is similar to the process as was used for k-means :
    • read into memory the stored t-SNE vectors along with the reference indices
    • initiate the spectralclustering from the sci-kit learn library
    • the affinity method chosen is 'nearest_neighbors'
  • the number of clusters was evaluated for 2 - 20 clusters
  • the results presented here correspond to the same t-SNE 2-D vectors associated to perplexity value of 100 as was used for the k-means clustering. This provides an opportunity to do a basic comparison of the different cluster methods on the same data set with identical dimensional reduction.
In [24]:
Image("./plots/spectral/spectral_8_cluster_map.png")
Out[24]:

Train and adjust Hierarchical

Under this section we will build 3 hierarchical models based on the linkage methods (Ward, Complete, Average). We will compare the 3 models behaviour, identify any patterns within.

To identify number of clusters needed we will be using dendrogram.

Once the models are run, we will look into summary table for silhoutte value and processing time.

Steps: The process for implementing the hierarchical clustering was as follows :

  1. Read into memory the full data set
  2. Read into memory the stored t-SNE vectors along with the reference indices
  3. Conduct a join on the 2 data sets to match rows of the t-SNE vectors to their corresponding rows of the feature data
  4. Create dendogram to identify optimal clusters under each method.
  5. Initiate the hierarchical clustering from the sci-kit learn library for 3 linkage methods
    • ward
    • complete
    • Average


Return to Table of Contents


5.0 Modeling and Evaluation 2

Under this section we will evaluate and compare each cluster algorithm.

We will evaluate the algorithm as below:

  • 5.1 K-means : Optimal clusters,Silhoutte score, inertia and processing time.
  • 5.2 Spectral: Optimal clusters,Silhoutte score and processing time.
  • 5.3 Hierarchical: Optimal clusters using dendogram,Silhoutte score and processing time.

5.1 Evaluate and compare K-Means Clustering

  • the results of the clustering were evaluated using the silhouette and inertia scores
  • the silhouette score provides a measure of the cohesion of the observations within their assigned cluster relative to their separation from the observations in the neighboring clusters. The higher the silhouette value (to a maximum value of 1) represents a more preferred clustering. For this evaluation, we plotted the average silhouette score for each K value for the range of clusters. From the plot below, the silhouette score indicates that a clustering of 7 clusters provides the highest silhouette score.
  • the inertia score provides the sum of squared distances of each value to its assigned cluster centroid. A lower inertia score indicates lower variance within the set of clusters. As this is a continually decreasing function relative to the number of clusters, typical practice is to identify an 'elbow' in the inertia vs. number of clusters plot as an indicator of optimal number of clusters. That method is used here to identify that approximately 7 clusters (or slightly greater) can be considered the range of appropriate number of clusters.

  • Thus, by standard measures, appropriate choices for number of clusters from this k-means clustering analysis is 7.

The 7 cluster set will be generated, visualizations provided, and results analysed in subsequent sections of this report.

The 7 cluster map in the t-SNE space is the figure shown above in Section 4.1.2.

Silhouette, Inertia, and Processing Time for K-Means clustering

In [25]:
Image("./plots/kmeans/cluster_kmeans_number_of_clusters_eval.png")
Out[25]:

5.2 Evaluate Spectral Clustering

  • the results of the clustering were evaluated using the silhouette scores
  • as stated above, the silhouette score provides a measure of the cohesion of the observations within their assigned cluster relative to their separation from the observations in the neighboring clusters.

For this evaluation, we plotted the average silhouette score for each K value for the range of clusters. From the plot below, the silhouette score indicates that a clustering of 7, 8, 9, or 10 provides the highest range silhouette scores; the local maximum is observed at 8 clusters.

Thus, by silhouette score, appropriate choices for number of clusters from this spectral clustering analysis is 7 to 10 clusters; we will evaluate the analysis with 8 clusters as that provides local maximum in silhouette score.

The 8 cluster map in the t-SNE space is the figure shown above in Section 4.1.2.

Silhouette score and Processing Time for Spectral clustering

In [26]:
Image("./plots/spectral/cluster_spctrl_number_of_clusters_eval.png")
Out[26]:

5.3 Evaluate Hierarchical Clustering

Under this section we will build 3 hierarchical models based on the linkage methods (Ward, Complete, Average). We will compare the 3 models behavior, identify any patters within.

To identify number of clusters needed we will be using dendrogram.

Once the models are run, we will look into summary table for silhoutte value and processing time.

Steps: The process for implementing the hierarchical clustering was as follows :

  1. Read into memory the full data set
  2. Read into memory the stored t-SNE vectors along with the reference indices
  3. Conduct a join on the 2 data sets to match rows of the t-SNE vectors to their corresponding rows of the feature data
  4. Create dendogram to identify optimal clusters under each method.
  5. Initiate the hierarchical clustering from the sci-kit learn library for 3 linkage methods
    • ward
    • complete
    • Average

Evaluation

  • The results presented here correspond to the t-SNE 2-D vectors associated to perplexity value of 100
  • The results of the clustering were evaluated using the silhouette scores and processing time.
  • The silhouette score provides a measure of the cohesion of the observations within their assigned cluster relative to their separation from the observations in the neighboring clusters. The higher the silhouette value (to a maximum value of 1) represents a more preferred clustering.

The figures below present the dendogram map and the corresponding cluster separation visualization as displayed on the t-SNE 2-D space for each of the above 3 linkages.

Dendogram and cluster formation , linkage type = Average

In [27]:
Image("./plots/hierarchical/dendrogram_average.png")
Out[27]:
In [28]:
Image("./plots/hierarchical/hc_average_eval.png")
Out[28]:

Dendogram and cluster formation , linkage type = Ward

In [29]:
Image("./plots/hierarchical/dendrogram_ward.png")
Out[29]:
In [30]:
Image("./plots/hierarchical/hc_ward_eval.png")
Out[30]:

Dendogram and cluster formation , linkage type = Complete

In [31]:
Image("./plots/hierarchical/dendrogram_complete.png")
Out[31]:
In [32]:
Image("./plots/hierarchical/hc_complete_eval.png")
Out[32]:

Optimal clusters for each linkage type

Looking at dendogram, we come up with following optimal clusters

  • Ward : 2
  • Complete: 3
  • Average: 4

We will run models for these values and create and evaluation matrix.

Hierarchical clustering Evaluation Matrix: Silhouette score, processing time:

In [33]:
Image("./plots/hierarchical/hc_evaluation_matrix.png")
Out[33]:

Hierarchical cluster matrix above shows an average to below average silhouette score for all 3 models. Although ward show a score of 41% it isnt a good model as it could not identify more than 2 clusters. This could be because of using T-SNE approach which might have lost distance and density information causing closely clustered data in the t-SNE 2-D mapped space.

Processing time has been good for all 3 models, this good processing time for a hierarchical model could be because of dimension reduction. Since we have reduced all features to just 2 of them.

We will do an in-depth analysis of each feature for each linkage type further.


Return to Table of Contents


6. Modeling and Evaluation 3

Under this section, we will visualize each model independently in the t-SNE 2D space displaying the magnitude and distributions of the original features relative to the cluster labels. We will produce 3 plots for each feature to assist with that visualization.

  • Spectrum Map left hand map is a spectrum map of that feature onto the t-SNE 2-D vector space, the color scale represents the magnitude of each point of the feature in that space,
  • Box-Plot: the center plot is a set of box-whisker plots of the same feature, where each boxplot is associtaed to each of the cluster labels. In addition to the boxplots which include the quartile definitions as the box boundaries, the population mean in each cluster is represented on the boxplot chart in (small) white text.
  • Feature distribution in each clusterthe right hand map is the representation of the cluster labels - differentiatied by each color region - of the clusters as visualized in the t-SNE 2D space.

At the end we will plot feature importance plot to summarize how each feature is distributed under each cluster for each model.

We will visualize this section for:
6.1 K-means
6.2 Spectral
6.3 Hierarchical

  • In this report we only include these visualizations for a subset of the features available. All visualizations are available in the separate code files for this analysis and at the referenced github site.
  • For all models, all of the features were included in providing an interpretation of the relative participation of each feature in defining each cluster's contributing characteristics, even if only a subset is displayed here.

6.1 K-Means Clusters - Visualize

  • To evaluate the resulting clusters for each of the k-value = 7 clusters the following approach is taken :

    • re-join the feature data set to the cluster identification regions from the k-means analysis for comparison of the features with the mapped cluster labels
    • construct visual interpretation aid of a 3-plot set for each feature as shown in below figures. Each 3-set of plots includes the following :

      • left hand map is a spectrum map of that feature onto the t-SNE 2-D vector space, the color scale represents the magnitude of each point of the feature in that space,
      • the center plot is a set of box-whisker plots of the same feature, where each boxplot is associtaed to each of the cluster labels. In addition to the boxplots which include the quartile definitions as the box boundaries, the population mean in each cluster is represented on the boxplot chart in (small) white text.
      • the right hand map is the representation of the cluster labels - differentiatied by each color region - of the clusters as visualized in the t-SNE 2D space
      • the 3-plot set can be viewed together for each feature to understand how that feature's relative values are distributed across the clusters, and allows to visualize
      • examples from 4 of the features (LDA_00, LDA_01, LDA_02, LDA_03, and LDA_04) are shown in the following set of plots
  • These plots show the following relationships :

    • LDA_00 high values are associated with Cluster 3.
    • LDA_01 high values are associated with Cluster 6.
    • LDA_02 high values are associated with Cluster 5.
    • LDA_03 high values are associated with Cluster 2.
    • LDA_04 high values are associated with Clusters 1 & 4.
  • Similarly, an observation about relative participation of each feature in each of the clusters was identified and used in the subsequent interpretations of the clusters.


LDA_00 distributions in K-Means cluster space

  • Cluster 3 is associated to high values of LDA_00 (correlated to Business Channel)
In [34]:
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_00.png")
Out[34]:

LDA_01 distributions in K-Means cluster space

  • Cluster 6 - associated to high values of LDA_01 (correlated to Entertainment Channel)
In [35]:
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_01.png")
Out[35]:

LDA_02 distributions in K-Means cluster space

  • Cluster 5 - associated to high values of LDA_02 (correlated to World Channel)
In [36]:
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_02.png")
Out[36]:

LDA_03 distributions in K-Means cluster space

  • Cluster 2 - associated to high values of LDA_03
In [37]:
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_03.png")
Out[37]:

LDA_04 distributions in K-Means cluster space

  • Clusters 1 & 4 - associated to high values of LDA_04 (correlated to Technology Channel)
In [38]:
Image("./plots/kmeans/cluster_kmeans_3way_preplx_100__7clstrsln_LDA_04.png")
Out[38]:

The 3-plot sets above aid in identifying the assocation of each feature relative to the cluster regions.

To further understand the cluster relationships, an additional view is presented below. For each cluster the mean value of each feature in that cluster was determined, the standard deviation of those means, and a z-score of each mean relative to other means in that cluster were compared. The goal was not to assess these z-scores for statistical signficantly different means, but rather as a method to identify in a consistent way the relative participation of each feature in each cluster. The goal is to identify the few most (both positively and negatively) impactful features in defining the cluster characteristics. The median of each cluster, or some other statistic, could have also been used for this purpose. For the clusters developed for this data set, means and medians provide essentially the same view of the major contributors to a cluster characterization.

The plots below show these distributions of means for each feature in each of the clusters developed from the above k-means application.

As an example, we can make the following observations of the clusters based on these plots (and a detailed examination of the underlying values in a data table)

  • Cluster 01
    • LDA_04 (correlated to Technology Channel) is strongest definer
    • several references to positive sentiment rank near the top contributors
    • All other LDA scores are negatively indicated in this cluster - there is a strong negative relationship to the other data channels (Business, Entertainment, World, and SocialMedia). In other words, this cluster is uniquely associated to the Technoglogy channel and also has appreciation for article content with stronger positive sentiments
  • Cluster 04
    • LDA_04 and LDA_02 are strong participants in this cluster . LDA_04 is correlated to the Technology Channel . LDA_02 is correlated to the World Channel .
    • Measures related to referencing other articles from within the mashable ecosystem (ln_self_reference_sharess)vare minimally used
    • Measures related to high use of keywords from within these articles is negatively associated to this cluster (kw_min_max and kw_avg_max) Thus, we can interpret this to be a cluster defined by an intersection or association of appreciation to World and Technolgy related content and with low reliance and interest on pursuing to other (even related) articles within the mashable site

Similary, an interpretation was completed for each cluster based on the relative measure of feature means distribution within each cluster.

As will be shown in subsequent sections, this exercise was repeated for each of the clustering methods deployed. A synopsis of relevant clusters from the overall analysis will be presented in the summary section.

Feature importance for K-means

Relative participation of each feature in each cluster (K-Means)

In [39]:
Image("./plots/kmeans/cluster_kmeans_cluster_barplots.png")
Out[39]:

6.2 Spectral Clustering - Visualize

  • To evaluate the resulting 7 clusters from the spectral clustering, a similar approach is taken as was used above for the k-means evaulation :

    • re-join the feature data set to the cluster identification regions from the spectral clustering analysis for comparison of the features with the mapped cluster labels
    • construct visual interpretation aid of a 3-plot set for each feature as shown in below figures. Each 3-set of plots includes the following :

      • left hand map is a spectrum map of that feature onto the t-SNE 2-D vector space, the color scale represents the magnitude of each point of the feature in that space,
      • the center plot is a set of box-whisker plots of the same feature, where each boxplot is associtaed to each of the cluster labels. In addition to the boxplots which include the quartile definitions as the box boundaries, the population mean in each cluster is represented on the boxplot chart in (small) white text.
      • the right hand map is the representation of the cluster labels - differentiatied by each color region - of the clusters as visualized in the t-SNE 2D space
      • the 3-plot set can be viewed together for each feature to understand how that feature's relative values are distributed across the clusters, and allows to visualize
      • examples from 4 of the features (LDA_00, LDA_01, LDA_02, LDA_03, and LDA_04) are shown in the following set of plots
  • These plots show the following relationships :

    • LDA_00 high values are associate with Cluster 2.
    • LDA_01 high values are associate with Cluster 7.
    • LDA_02 high values are associate with Cluster 1.
    • LDA_03 high values are associate with Clusters 3 & 6.
    • LDA_04 high values are associate with Clusters 1, 4, & 5.
  • Similarly, an observation about relative participation of each feature in each of the clusters was identified and used in the subsequent interpretations of the clusters.

LDA_00 distributions in SpectralClustering cluster space

  • Cluster 2 - associated to high values of LDA_00 (correlated to Business Channel)
In [40]:
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_00.png")
Out[40]:

LDA_01 distributions in SpectralClustering cluster space

  • Cluster 7 - associated to high values of LDA_01 (correlated to Entertainment Channel)
In [41]:
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_01.png")
Out[41]:

LDA_02 distributions in SpectralClustering cluster space

  • Clusters 1 - associated to high values of LDA_02 (correlated to World Channel)
In [42]:
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_02.png")
Out[42]:

LDA_03 distributions in SpectralClustering cluster space

  • Clusters 3 & 6 - associated to high values of LDA_03
In [43]:
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_03.png")
Out[43]:

LDA_04 distributions in SpectralClustering cluster space

  • Clusters 1, 4, & 5 - associated to high values of LDA_04 (correlated to Technology Channel)
In [44]:
Image("./plots/spectral/cluster_spctrl_3way_preplx_100_ln_LDA_04.png")
Out[44]:

The 3-plot sets above aid in identifying the assocation of each feature relative to the cluster regions.

To further understand the cluster relationships, an additional view is presented. For each cluster the mean value of each feature in that cluster was determined, the standard deviation of those means, and a z-score of each mean relative to other means in that cluster were compared. The goal was not to assess these z-scores for statistical signficantly different means, but rather as a method to identify in a consistent way the relative participation of each feature in each cluster. The goal is to identify the few most (both positively and negatively) impactful features in defining the cluster characteristics. The median of each cluster, or some other statistic, could have also been used for this purpose. For the clusters developed for this data set, means and medians provide essentially the same view of the major contributors to a cluster characterization.

The plots below show these distributions of means for each feature in each of the clusters developed from the above k-means application.

As an example, we can make the following observations of the clusters based on these plots (and a detailed examination of the underlying values in a data table)

  • Cluster 01

    • LDA_04 (correlated to Technology Channel) is strongest definer
    • several references to positive sentiment rank near the top contributors
    • All other LDA scores are negatively indicated in this cluster - there is a strong negative relationship to the other data channels (Business, Entertainment, World, and SocialMedia). In other words, this cluster is uniquely associated to the Technoglogy channel and also has appreciation for article content with stronger positive sentiments
  • Cluster 04

    • LDA_04 and LDA_02 are strong participants in this cluster . LDA_04 is correlated to the Technology Channel . LDA_02 is correlated to the World Channel .
    • Measures related to referencing other articles from within the mashable ecosystem (ln_self_reference_sharess)vare minimally used
    • Measures related to high use of keywords from within these articles is negatively associated to this cluster (kw_min_max and kw_avg_max) Thus, we can interpret this to be a cluster defined by an intersection or association of appreciation to World and Technolgy related content and with low reliance and interest on pursuing to other (even related) articles within the mashable site

Similary, an interpretation was completed for each cluster based on the relative measure of feature means distribution within each cluster.

As will be shown in subsequent sections, this exercise was repeated for each of the clustering methods deployed. A synopsis of relevant clusters from the overall analysis will be presented in the summary section.


Relative participation of each feature in each cluster


In [45]:
Image("./plots/spectral/cluster_spctrl_cluster_barplots_horizontal.png")
Out[45]:

6.3 Hierarchical cluster Visualize

To evaluate the resulting clusters for each of the linkage method
  1. Re-join the feature data set to the cluster identification regions from the method(ward,complete,average) analysis for comparison of the features with the mapped cluster labels
  2. Construct visual interpretation aid of a 3-plot set for each feature as shown in below figures. Each 3-set of plots includes the following :
    a. Left hand map is a spectrum map of that feature onto the t-SNE 2-D vector space, the color scale represents the magnitude of each point of the feature in that space,
    b. The center plot is a set of box-whisker plots of the same feature, where each boxplot is associtaed to each of the cluster labels. In addition to the boxplots which include the quartile definitions as the box boundaries, the population mean in each cluster is represented on the boxplot chart in (small) white text.
    b. The right hand map is the representation of the cluster labels - differentiatied by each color region - of the clusters as visualized in the t-SNE 2D space.

The 3-plot set can be viewed together for each feature to understand how that feature's relative values are distributed across the clusters, and allows to visualize.

Feature importance for Hierarchical clustering by linkage type:

a. ward
In [46]:
Image("plots/hierarchical/HC_ward_feature_importance.png")
Out[46]:

b. complete

In [47]:
Image("plots/hierarchical/HC_complete_feature_importance.png")
Out[47]:

c. average

In [48]:
Image("plots/hierarchical/HC_average_feature_importance.png")
Out[48]:

Based of feature importance, we will pick interesting features to visualize them in depth

LDA_00 for all 3 linkage methods:
In [49]:
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_00.png")
Out[49]:
In [50]:
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_00.png")
Out[50]:
In [51]:
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_00.png")
Out[51]:

b. LDA_01 for all 3 linkage methods:

In [52]:
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_01.png")
Out[52]:
In [53]:
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_01.png")
Out[53]:
In [54]:
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_01.png")
Out[54]:

a. LDA_02 for all 3 linkage methods:

In [55]:
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_02.png")
Out[55]:
In [56]:
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_02.png")
Out[56]:
In [57]:
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_02.png")
Out[57]:

a. LDA_03 for all 3 linkage methods:

In [58]:
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_03.png")
Out[58]:
In [59]:
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_00.png")
Out[59]:
In [60]:
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_00.png")
Out[60]:

a. LDA_04 for all 3 linkage methods:

In [61]:
Image("./plots/hierarchical/cluster_HC_ward_3way_preplx_100__4clstrsln_LDA_04.png")
Out[61]:
In [62]:
Image("./plots/hierarchical/cluster_HC_complete_3way_preplx_100__4clstrsln_LDA_00.png")
Out[62]:
In [63]:
Image("./plots/hierarchical/cluster_HC_average_3way_preplx_100__4clstrsln_LDA_00.png")
Out[63]:

We will use these Visualzation plots further section to summarize and conclude.


Return to Table of Contents

7.0 Modeling and Evaluation 4

Under this section we will use visualization plots to understanding how features are clustered and arrive at any patters that can be seen.

We will identify which features define each of the clusters in each of the Clustering algorithms. We will compare composition of clusters across algorithms.

  • 7.1 Kmeans analysis
  • 7.2 Spectral analysis
  • 7.3 Hierarchical analysis
  • 7.4 Comparitive study of all types of algorithm

7.1 K-Means clustering summary:

After review of the relative positive and negative participation of the features in the clusters, we identified that there are some consistent themes that support a synopsis / summarization of the salient characteristics and identifying contrasting characteristics of the clusters in a way that supports the business interests.

  1. Almost all clusters can be associated as having very strong participation of the Latent Dirichlet allocation measures (LDA_nn). We identified from previous analyses that these LDA scores are also highly correlated to specific Channels in the mashable network. As a starting point, then, we identify the clusters to an associated data channel based on the highest LDA participation and its corresponding data channel.
  2. It is interesting from the business perspective to understand the relative participation of images and videos in the cluster content, as that is contrasting across the clusters.
  3. Clusters can be uniquely associated to one data channel or can be cross-channel characterized. We characterize the relative strength of the alternate channels within the context of the prime channel for each cluster.
  4. There is some interest in understanding how content length contributes to each cluster's characteristics, so the measures associated to article length are included comparative analysis.
  5. The relationship of content to the use of other resources within the mashable ecosystem shows some measurable contrasts and is also relevant to the understand if content supports retention of a user within the mashable site. Measures that identify 'within mashable' references are identified for each cluster.
  6. Although the cluster analysis did not identify strong associations between the clusters and relative strength of social media shares of the readers, it is interesting to identify the relative scale of shares / popularity across the clusters. This is shown not to be predictive across the clusters, but the relative trends are still characterized for reference.

The below table indicates for each cluster the relative strength (positive and negative) of these characteristics in defining a cluster composition.

In [64]:
Image("./plots/kmeans_cluster_summary.png")
Out[64]:

7.2 - Spectral Clustering summary

Similar to the above summary regarding characterization of K-Means .

  1. Almost all clusters can be associated as having very strong participation of the Latent Dirichlet allocation measures (LDAA_xx). We identified from previous analyses that these LDA scores are also highly correlated to specific Channels in the mashable network. As a starting point, then, we identify the clusters to an associated data channel based on the highest LDA participation and its corresponding data channel.
  2. It is interesting from the business perspective to understand the relative participation of images and videos in the cluster content, as that is contrasting across the clusters.
  3. Clusters can be uniquely associated to one data channel or can be cross-channel characterized. We characterize the relative strength of the alternate channels within the context of the prime channel for each cluster.
  4. There is some interest in understanding how content length contributes to each cluster's characteristics, so the measures associated to article length are included comparative analysis.
  5. The relationship of content to the use of other resources within the mashable ecosystem shows some measurable contrasts and is also relevant to the understand if content supports retention of a user within the mashable site. Measures that identify 'within mashable' references are identified for each cluster.
  6. Although the cluster analysis did not identify strong associations between the clusters and relative strength of social media shares of the readers, it is interesting to identify the relative scale of shares / popularity across the clusters. This is shown not to be predictive across the clusters, but the relative trends are still characterized for reference.

The below table indicates for each cluster the relative strength (positive and negative) of these characteristics in defining a cluster composition.

In [65]:
Image("./plots/spectral_cluster_summary.png")
Out[65]:

7.3 hierarchical clustering summary

For our analysis, we will look into each of the hierarchical models and compare the behaviour of features under each type. We will start by looking at the comparison matrix.

We arrived at the below matrix using the visualization plots.

  • A H indicates a strong impact of the feature in the cluster.
  • A L indicates a very low to null impact of the feature in that cluster
  • A blank indicates neutral.

e.x. ln_LDA_00 for linkage type 'Ward' has strong data in cluster 0 and almost no data in cluster 1.

In [66]:
Image("./plots/hierarchical/HC_comparisonMatrix.png")
Out[66]:

For our analysis, we will look into each of the hierarchical models and compare the behaviour of features under each type. We will start by looking at the comparison matrix.

  1. LDA00 thru LDA04: We can see that overall LDA00 thru LDA04 are highly impactful in creating clusters for all 3 types of models. What is interesting here is LDA_04, Ward and average methods used this feature values to distinguish clusters but complete spread it evenly betweem all 3 clusters. Looking at the LDAs, we can clearly see that 'Ward' makes high and low value clusters.

  2. data_channel (Entertainment, social media, lifestyle): Among all data_channels, Entertainment, Lifestyle and Social media features are in our final list of features. entertainment has made an impact on all 3 models in forming clusters. Most entertainment values have formed a strong cluster of its own. Cluster 1 of 'ward', cluster 1 of 'complete' and culster 4 of 'average' can be called high-entertainment cluster. Among the 3, entertain ment has huge data with 7K mashable links while Social media and lifestyle has only around 2k.

  3. Numbers like images, videos: These features are particularly interesting as they behave differently for each of the models. while low videos form a cluster in complete and average, ward method shows no deviation. The images however form a cluster based on the images in an article. Ward and complete have formed a cluster with high number of images. Also, a high number of images also forms strong bond with entertainment

  4. Global appeal and postivity: ( global_subjectivity, global_rate_positive_words, rate_positive_words, max_positive_polarity) These set of features are interesting to note as it shows strongly impactful in 'average' but has no impact in other two methods. The 'average' method formed cluster 2 which could be named as low global appeal and low positive words. This is the same cluster which was low on LDA_00 and LDA_04.

  5. Each method by itself:
    a. ward: Just 2 clusters, with LDAs being major factors, number of images and data_channel entertainment. Cluster 1 is high images and entertainment, there is strong visual importance here. cluster 0 is low image count and ????

    b. complete: 3 clusters, cluster 1 is strong on LDA_01, LDA_03, entertainment and high image count. This is somewhat similar to cluster 1 of ward for high values. While complete forms a different cluster, cluster 2 for low values of these features. while cluster 0 is mostly LDA_02 and low values of LDA_01, LDA_03.

    c. average: formed 4 clusters. Cluster 3 is high on LDA_01, LDA_03, images and entertainment. this is consistent with other methods as well. Cluster 0 strong on LDA_00 and LDA_04 and away from entertainment and videos. Cluster 2 seems a very small cluster mostly of LDA_03 values with low entertainment values.

Patterns:
LDA_01, LDA_03, entertainment and imgaes are consistently together. Clearly certain set of vocabulary related to entertainment are found in LDA_01 and LDA_03. Putting this together with number of shares, the high shares under entertainment like "McDonalds Kills Site That Advised Employees to Eat Healthy Meals", "What to Do With Your New Xbox one" have high count of images, more than 10, while low shares under entertainment like "11 TV Characters Older Than the Class of 2018", "Hasbro Games Show Signs of Life in iPad World" show 1 to none images.

Notes: The patterns observed here is specific to the data we took to analyze. We cannot extrapolate these behavior to future mashable data.

7.4 Comparative summary

The below table presents the compilation of cluster characterization for all models.

For each cluster's primary defining characteristic, i.e., Data Channel Name, the relative scores of the strength of the other contributing characteristics is shown. The comparative analysis is accomplished by reading the table vertically. All of the World grouped clusters are shown in the first two columns, all of the Technology related clusters are shown in next set of columns, etc.

The below table indicates for each cluster the relative strength (positive and negative) of these characteristics in defining a cluster composition. The table also aligns the similar composition clusters between the clusters developed by the K-Means, SpectralClustering, and hierarchical methods.

The comparison between the clustering methods are grouped vertically in the table so that a visual comparison can be made between similarly related channels E.g., the K-Means clusters associated to the World Data Channel are vertically aligned with the SpectralClustering cluster that is also associated to the World Data Channel.

Some observations :

  • the basic cluster characterizations as developed by k-means and spectral clustering are nearly identical in identifying positive and negative contributions from among the seven defining attributes of the clusters
  • the hierarchical clustering, although with fewer clusters defined, also show similarity with the k-means and spectral clusters where a comparison exists
  • the seconday views of the data channels - there are two World channels from the k-means and three Technology channels from the spectral clustering - may provide some insights into relevant sub-groupings that show contrasts within a given data channel
    • as an example, within the Technology subsets, the Tech + Entertainment channel has the strongest negative participation in popularity/shares among all of the attributes identified. Since revenue is linked to social media shares, it is likely interesting to evaluate deeper why there is low participation in this sub-cluster and social media shares.
  • we did identify a cluster that is not associated strongly to any one of the prime Data Channels, which we call Not Top 4. This cluster may represent the miscellaneous articles scattered among the primary channels. This cluster tends to have rate of images and video content. It is perhaps useful to consider if this miscellaneous fits within the business model of the network, if the business model is that each article be identified with a data channel so that it is purposeful and provides a cohesive organziation for the readers.
In [67]:
Image("./plots/all_cluster_comparison.png")
Out[67]:

Analysis of Clusters

After review of the relative positive and negative participation of the features in the clusters, we identified that there are consistent themes that support the below conclusions :

  1. Almost all clusters can be associated as having very strong alignment to each of the prime Data Channels in the mashable network. We identify the clusters by these data channels based on the highest LDA participation that is correlated to each particular channel.
  2. The relative participation of images and videos in each cluster content is provided.
  3. The relative strength of cross-channel participation in each cluster is characterized.
  4. The content length contribution to each cluster's character is depicted.
  5. The relationship of content to the use of other resources within the mashable ecosystem shows some measurable contrasts and is also relevant to the understand if content supports retention of a user within the mashable site. Measures that identify 'within mashable' references are identified for each cluster.
  6. The cluster analysis identified that there is not a strong association between the clusters and relative strength of social media shares of the readers; nevertheless, the relative scale of shares / popularity across the clusters is included for reference.

Deployment


Usefulness

The purpose of this model is to provide a baseline characterization to product owners (data channel editorial content owners) that describes defining features of their respective data channel, and that those features are 'actionable' from the perspective of the editors. The editors plan to make controlled experimental modifications to the content in the coming months, and the purpose of this model is to define the product lines (data channels) by their clusters and the contributing constituents to each of those clusters. The information proposed in this report will form the basis for the content modification experiments.

To an extent, the primary objective of this project is satisfied. A basline characterization of relevant clusters and features has been defined by the work above.

However, as this is the first model of this type on this data set, we recognize that there are opportunites for exploration that may produce measurable improvements if pursued.

The method that we deployed initiated with a dimensionality reduction using t-SNE.

  • Some positive aspects of this approach :
    • This method has the advantage that it develops visualizations which are easily displayed - we found that the individual feature magnitudes mapped onto the t-SNE vector space provided useful and easily interpretable methods for evaluating relationships among the features and their relative contributions to each of the clusters
    • The dimension reduction from 40 dimensions on 10,000 rows of data to 2 or 3 dimensions is reasonably easily accomplished
    • providing a data set of 2 or 3 dimensions greatly reduces the computational effort and aids the visual interpretation for the subsequent clustering analyses
  • However :
    • Computational time of this method can be a limiting factor on larger data sets. We found that 40,000 * 40 features (1.5M elements) did not converge after several hours of computational time (Intel Core i5-4210U CPU @ 1.70 GHz x 4), thus downsampling to 35% of total available data was used. Since the model is not intended to support real-time immediate results, accommodations for larger data sets can be accomplished by appropriate scheduling, but this characteristic does limit appetite for repetitive analyses
    • t-SNE does not provide exactly the same results on successive runs with the same data set; our views indicated that there was strong similarity in the basic cluster patterns (we did not quantify this variation, but that could be accomplished with additonal effort)
    • the t-SNE mapping does diffuse density characteristics in a way that information loss relevant to clustering was likely lost in the transformation. There is a potential that localized high density regions exist in the nominal data set (based on the basic plots reviewed during EDA, this seems likely) - different anaysis methods may retain that information and provide some additional insight into additional cluster complexity

The k-means analysis that we used benefited from the t-SNE mapping, as the resulting 2D t-SNE vectors produced visually had globular 2-dimensional distributions.

  • k-means is computationally efficient, and can be a good starting point for a basic clustering analysis. We found that on this data set, after the t-SNE mapping, the k-means approach provided useful results.

The spectral clustering analysis that we performed provided similar cluster definitions as did the k-means analysis. This may be due to the fact that the t-SNE mapped vectors developed near neighbors patterns that are generally amenable to a k-means cluster approach and therefore some of the additional capability that can be achieved with spectral clustering was not fully realizable. Future opportunities should look to alternative dimension reduction techniques and utilize more of the complex geometry clustering available within spectral clustering method. More nuanced views of clusters might be developed in that way.

The hierarchical clustering as deployed on this data set resulted in only 2 - 4 clusters, and in the case of the 4 cluster solution, the 4th cluster was a very small portion of the data set. So, in effect, hierarchical identifed 2 or 3 clusters within the whole domain. Although interesting, for the intended purposes, we find that somewhat limited usefulness. The close coupling of the variables in the rather dense space of the t-SNE map may have allowed the hierarchical clustering, using Euclidean distance, to extend widely across the space and reduce the number of clusters. If this method is used in future evaluations of this data set, some method for additional delineation of the space should be explored.

Deployment method / external data support

The current vision for deployment of the model is relatively straightforward :

  • allow the content editors sufficient time to evaluate the results of this report and consult for further clarification with these results
  • the content editors to decide characteristics within the clusters that are within their control; the editorial staff can modify content to effect learning sets
  • subsequent clustering models, using the tools herein developed, can be efficiently re-deplyed on mashable experiences that are to occur in the coming months
  • cluster compositional changes can be compared to this baseline model once those experiences are available
  • it is not envisioned at this time to deploy this model to be run on a continuing basis nor production environment, but rather as a means to support content evaluations on targeted specific evaluations

Model Updates

  • this model is first edition; the usefulness requires evaluation from the product content owners to benefit from their domain expertise. It is expected that collaboration with subject matter experts will provide input for some relevant modifications to the model thus far developed
  • the model's input features are elements that have a certain temporal durability (positive and negative sentiments, domain segregation of vocabulary, hyper-link reference counts). But the fact remains that what are characteristics of web-published articles is ever-evolving. Because of that, the recommendation is that, at the initial deployment phases, the model characteristics be re-evaluated on a regular basis for continued relevance. Monitoring for evolution of the model charactersitic changes will provide opportunity for updated understanding that remains current.
  • longer term, the survivability of this model will be dependent on the ability to evaluate changes in readership habits in conjuction with intentional changes introduced by the editorial staff. This requires an additional model that has correlational capability to characteristics such as number of social media shares or article popularity. Previous efforts on this data set, using support vector machine and even logistic regression have shown that capability to a satisfactory level. Thus, the clustering model will need support from the correlational models to be able to develop full benefit.


Return to Table of Contents


9. Exceptional work

For our exceptional work we have implemented a dimension reductionality technique t-SNE.

Additionally, the visualizations of the association between cluster maps, feature magnitudes superposed on the maps, and the associated boxplot distributions provided an efficient and useful visualization to understand feature distribution within each cluster.

Thank you !!

In [ ]: